EAST Search History 



Ref 

# 


Hits 


Search Query 


DBs 


Default 
Operator 


Plurals 


Time Stamp 


LI 


15549 


709/707 709/218 709/204 705/14 


US-PGPUB; 

USPAT; 

USOCR; 

FPRS; 

EPO; JPO; 

IBM_TDB 


OR 


ON 


2007/11/30 14:15 


L2 


105 


craw$4 and (error nearlO URL) 


US-PGPUB; 

USPAT; 

USOCR; 

FPRS; 

EPO; JPO; 

IBM_TDB 


OR 


ON 


2007/11/30 14:15 


L3 


10 


1 and L2 


US-PGPUB; 

USPAT; 

USOCR; 

FPRS; 

EPO; JPO; 

IBM_TDB 


OR 


ON 


2007/11/30 14:15 


SI 


63 


web adj crawling 


USPAT 


OR 


OFF 


2005/04/14 09:37 


S2 


55 


SI and URL 


USPAT 


OR 


OFF 


2005/04/13 14:28 


S3 


14 


" 59999 29» 


USPAT 


OR 


OFF 


2005/04/13 10:28 


S4 


48 


"5974572" 


USPAT 


OR 


OFF 


2005/04/13 10:28 


S5 


47 


("5974572").URPN. 


USPAT 


OR 


OFF 


2005/04/13 10:30 


S6 


9 


"6253204" 


USPAT 


OR 


OFF 


2005/04/13 10:56 


S7 


1 


"20020013782" 


US-PGPUB; 
USPAT 


OR 


OFF 


2005/04/13 10:56 


S8 


63 


web adj crawling 


USPAT 


OR 


OFF 


2005/04/13 14:28 


S9 


55 


S8 and URL 


USPAT 


OR 


OFF 


2005/04/13 14:28 


S10 


15 


S9 and script 


USPAT 


OR 


OFF 


2005/04/13 14:28 


Sll 


13 


S10 and code 


USPAT 


OR 


OFF 


2005/04/13 14:28 


S12 


12 


Sll and web adj page 


USPAT 


OR 


OFF 


2005/04/13 15:10 


S13 


0 


S8 and script same URL same 
crawl$4 


USPAT 


OR 


OFF 


2005/04/13 15:10 


S14 


—7 

7 


script same URL same crawl$4 


US-PGPUB; 

USPAT; 

USOCR 


OR 


ON 


2005/04/13 15:11 


S15 


0 


("2004/0143787").URPN. 


USPAT 


OR 


OFF 


2005/04/13 15:38 


S16 


153 


URL same crawl$4 


USPAT 


OR 


OFF 


2005/04/13 15:38 


S17 


147 


S16 and @ad<"20020619" 


USPAT 


OR 


OFF 


2006/08/17 21:07 


S18 


35 


S17 and script 


USPAT 


OR 


OFF 


2005/04/13 15:38 


S19 


1 


"20020052928" 


US-PGPUB; 
USPAT 


OR 


OFF 


2005/04/14 09:37 
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S20 


20 


script same crawl$4 


US-PGPUB; 
USPAT 


OR 


OFF 


2005/04/14 09:37 


S21 


16 


S20 and @ad< 20020619 


US-PGPUB; 
USPAT 


OR 


OFF 


2005/04/14 09:37 


S22 


40 


identif$4 same script same code 
same web 


USPAT 


OR 


OFF 


2005/04/14 11:20 


S23 


39 


S22 and @ad<"20020619" 


USPAT 


OR 


OFF 


2005/04/14 17:35 


S24 


0 


"68571247" 


USPAT 


OR 


OFF 


2005/04/14 16:44 


S25 


1 


"6857124" 


USPAT 


OR 


OFF 


2005/04/14 16:44 


S26 


40 


identif$4 same script same code 
same web 


USPAT 


OR 


OFF 


2005/04/14 17:35 


S27 


39 


S26 and @ad<"20020619" 


USPAT 


OR 


OFF 


2007/04/10 15:35 


S28 


12 


S27 and notification 


USPAT 


OR 


OFF 


2005/04/14 18:10 


S29 


3106 


search adj engine 


USPAT 


OR 


OFF 


2005/04/14 18:10 


S30 


5656 


search$3 same engine 


USPAT 


OR 


OFF 


2005/04/14 18:10 


S31 


2608 


S30 and web 


USPAT 


OR 


OFF 


2005/04/14 18:10 


S32 


21 


web same page same load$4 same 
URL same script 


USPAT 


OR 


OFF 


2005/04/14 18:11 


S33 


6 


S32 and (search adj engine) 


USPAT 


OR 


OFF 


2005/04/14 18:11 


S34 


6 


S33 and @ad<"20020619" 


USPAT 


OR 


OFF 


2005/04/15 10:25 


S35 


4 


"6424966" 


USPAT 


OR 


OFF 


2005/04/15 11:28 


S36 


0 


"20020052928" 


USPAT 


OR 


OFF 


2005/04/15 11:28 


S37 


1 


"20020052928" 


US-PGPUB; 
USPAT 


OR 


OFF 


2005/04/15 11:29 


S38 


1 


"10064176" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 14:57 


S39 


1 


n 20020052928" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:12 


S40 


1 


"20040143787" and execution 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:40 


S41 


1297 


script$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM TDB 


OR 


ON 


2005/12/28 15:40 
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S42 


2070 


execut$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:41 


S43 


179 


S41 same S42 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:41 


S44 


141 


S43 and (@ad<"20020619" 
@rlad<"20020619") 


US-PGPUB; 
USPAT; 
USOCR; • 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:43 


S45 


10 


S44 and 709/218.ccls. 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:46 


S46 


131 


S44 not S45 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 15:46 


S47 


120 


S46 and web 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S48 


1297 


script$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S49 


2070 


execut$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S50 


179 


S48 same S49 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S51 


141 


S50 and (@ad<"20020619" 
@rlad<"20020619") 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM TDB 


OR 


ON 


2005/12/28 17:50 
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S52 


10 


S51 and 709/218.ccls. 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S53 


131 


S51 not S52 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S54 


120 


S53 and web 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S55 


120 


S54 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:50 


S56 


3 


S55 and crawler 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2005/12/28 17:51 


S57 


1 


"20020052928" 


USPAT 


OR 


OFF 


2006/08/17 21:07 


S58 


0 


"20020147637" 


USPAT 


OR 


OFF 


2006/08/17 21:07 


S59 


1 


"20020147637" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; . 
IBM_TDB 


OR 


ON 


2006/08/17 21:07 


S60 


14485 


crawler$3 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM.TDB 


OR 


ON 


2006/08/17 21:07 


S61 


12841635 


@rlad<"20020619" 
@ad< "20020619" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/17 21:08 


S62 


10176 


S61 and S60 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM TDB 


OR 


ON 


2006/08/17 21:08 
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S63 


716 


S62 and URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/17 21:08 


S64 


678 


URL same crawl$3 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/17 21:12 


S65 


454 


S64 and S61 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/17 21:13 


S66 


124 


S64 and S61 and spider$l 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/1721:13 

• 


S67 


2 


"20020052928" 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S68 


9713 


709/224 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S69 


29 


707/14 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S70 


87 


web adj crawling 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S71 


77 


S70 and URL 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S72 


21 


"5999929" 


USPAT 


OR . 


OFF 


2006/08/18 17:03 


S73 


67 


"5974572" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S74 


66 


("5974572").URPN. 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S75 


10 


"6253204" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S76 


1 


"20020013782" 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S77 


87 


web adj crawling 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S78 


77 


S77 and URL 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S79 


21 


S78 and script 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S80 


17 


S79 and code 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S81 


16 


S80 and web adj page 


USPAT 


OR . 


OFF 


2006/08/18 17:03 


S82 


0 


S77 and script same URL same 
crawl$4 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S83 


18 


script same URL same crawl$4 


US-PGPUB; 

USPAT; 

USOCR 


OR 


ON 


2006/08/18 17:03 
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S84 


0 


("2004/0143787").URPN. 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S85 


211 


URL same crawl$4 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S86 


193 


S85 and @ad<"20020619" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S87 


46 


S86 and script 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S88 


2 


"20020052928" 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S89 


39 


script same crawl$4 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S90 


20 


S89 and @ad< 20020619 


i tc r\/~"r\i in. 

US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S91 


62 


identif$4 same script same code 
same web 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S92 


58 


S91 and @ad<"20020619" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S93 


0 


"68571247" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S94 


1 


"6857124" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S95 


62 


identif$4 same script same code 
same web 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S96 


58 


S95 and @ad<"20020619" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S97 


16 


S96 and notification 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S98 


4235 


search adj engine 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S99 


7347 


search$3 same engine 


i irn a~t 

USPAT 


OR 


Orr 


200o/0o/lo 17:03 


SIO 
0 


3642 


S99 and web 


USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
1 


29 


web same page same load$4 same 
URL same script 


USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
2 


8 


S101 and (search adj engine) 


USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
3 


6 


S102 and @ad<"20020619" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
4 


7 


"6424966" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
5 


1 


"20020052928" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
6 


2 


"20020052928" 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


SIO 
7 


1 


"10064176" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 
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SIO 
8 


2 


"20020052928" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


SIO 
9 


1 


"20040143787" and execution 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


Sll 
0 


1487 


script$3 nearlb URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 
1 


2379 


execut$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 

2 


204 


S110 same Sill 


US-PGPUB; 
USPAT; • 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 
3 


157 


S112 and (@ad<"20020619" 
@rlad<"20020619") 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 
4 


11 


S113 and 709/218.ccls. 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 
5 


146 


SI 13 not SI 14 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 
6 


135 


SI 15 and web 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


Sll 
7 


1487 


script$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 
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Sll 
8 


2379 


execut$3 nearlO URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


Sll 
9 


204 


S117 same S118 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


S12 
0 


157 


S119 and (@ad<"20020619" 
@rlad<"20020619") 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


S12 
1 


11 


S120 and 709/218.ccls. 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


S12 
2 


146 


S120notS121 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


S12 
3 


135 


S122 and web 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


S12 
4 


135 


S123 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB , 


OR 


ON 


2006/08/18 17:03 


S12 
5 


3 


S124 and crawler 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 


S12 
6 


1 


"20020052928" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S12 
7 


0 


"20020147637" 


USPAT 


OR 


OFF 


2006/08/18 17:03 


S12 
8 


1 


"20020147637" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBMJTDB 


OR 


ON 


2006/08/18 17:03 
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S12 
9 


14485 


crawler$3 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
0 


12841635 


@rlad<"20020619" 
@ad<"20020619" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
1 


10176 


S130 and S129 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
2 


716 


S131 and URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
3 


678 


URL same crawl$3 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
4 


454 


S133 and S130 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
5 


124 


S133 and S130 and spider$l 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2006/08/18 17:03 


S13 
6 


2 


"20020052928" 


US-PGPUB; 
USPAT 


OR 


OFF 


2006/08/18 17:03 


S13 
7 


1 


"20020019851" 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2007/04/10 15:34 


S13 
8 


51 


(script near2 code) with URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM.TDB 


OR 


ON 


2007/04/10 15:37 
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S13 
9 


5 


S138 and (spider crawl$2) 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2007/04/10 15:37 


S14 
0 


722910 


@rlad<"20020619" and 
@ad< "20020619" 


USPAT 


OR 


OFF 


2007/04/10 15:35 


S14 
1 


0 


S140 and S139 


USPAT 


OR 


OFF 


2007/04/10 15:35 


S14 
2 


857194 


@rlad< 20020619 and 
@ad< M 20020619 M 


US-PGPUB; 

USPAT; 

USOCR; 

FPRS; 

EPO; JPO; 

IBM_TDB 


OR 


ON 


2007/04/10 15:36 


S14 
3 


0 


S142 and S139 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2007/04/10 15:36 


S14 
4 


153 


(script near2 code) same URL 


US-PGPUB; 
USPAT; 
USOCR; 
EPO; JPO; 
IBM_TDB 


OR 


ON 


2007/04/10 15:37 


S14 
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Web crawlin g and measurement: Efficient URL caching for world wide web crawlin g 
Andrei Z. Broder, Marc IMajork, Janet L. Wiener 

May 2003 Proceedings of the 12th international conference on World Wide Web 
WWW '03 

Publisher: ACM Press 

Full text available- ■PI odff 1 74 37 KB) Additional Information: full citation , abstract , references , citings, index 
• l^l — : 1 terms 

Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse 
it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, 
the size of the web (estimated at over 4 billion pages) and its rate of change (estimated 
at 7% per week) move this plan from a trivial programming exercise to a serious 
algorithmic and system design challenge. Indeed, these two factors alone imply that for a 
reasonably fresh and complete crawl of the web, step (a) ... 

Keywords: URL caching, caching, crawling, distributed crawlers, web crawlers, web 
graph models 



Web en g ineering: Evaluation of crawling policies for a web-repository crawler 
Frank McCown, Michael L. Nelson 

August 2006 Proceedings of the seventeenth conference on Hypertext and 
hypermedia HYPERTEXT 06 

Publisher: ACM Press 

Full text available* •PI pdf(482 40 KB) Additional Information: full citation , abstract , references , citings, index 

terms 

We have developed a web-repository crawler that is used for reconstructing websites 
when backups are unavailable. Our crawler retrieves web resources from the Internet 
Archive, Google, Yahoo and MSN. We examine the challenges of crawling web 
repositories, and we discuss strategies for overcoming some of these obstacles. We 
propose three crawling policies which can be used to reconstruct websites. We evaluate 
the effectiveness of the policies by reconstructing 24 websites and comparing the 
result ... 

Keywords: crawler policy, digital preservation, search engine, website reconstruction 



LSCrawler: A Framework for an Enhanced Focused Web Crawler Based on Link 
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Semantics 

M. Yuvarani, N. Ch. S. IM. Iyengar, A. Kannan 

December 2006 Proceedings of the 2006 IEEE/WIC/ACM International Conference on 
Web Intelligence WI '06 

Publisher: IEEE Computer Society 

Full text available: pdf(191 .22 KB) Additional Information: full citation , abstract , index terms 

The traditional process of focused web crawler is to harvest a collection of web documents 
that are focused on the topical subspaces. The intricacy of focused crawlers is identifying 
the next most important and relevant link to follow. Focused Crawlers mostly rely on 
probabilistic models for predicting the relevancy of the documents. The Web documents 
are well characterized by the hypertext and the hypertext can be used to determine the 
relevance of the document to the search domain. The semanti ... 

On the design of a learning crawler for topical resource discovery 
Charu C. Aggarwal, Fatima Al-Garawi, Philip S. Yu 

July 2001 ACM Transactions on Information Systems (TOIS), volume 19 issue 3 
Publisher: ACM Press 

Full text available* E pl| pdf(324 39 KB). Additional Information: full citation , abstract , references , citings, index 
. |^|_ * terms 

In recent years, the World Wide Web has shown enormous growth in size. Vast 
repositories of information are available on practically every possible topic. In such cases, 
it is valuable to perform topical resource discovery effectively. Consequently, several new 
ideas have been proposed in recent years; among them a key technique is focused 
crawling which is able to crawl particular topical portions of the World Wide Web quickly, 
without having to explore all web pages. In this paper, we propose ... 

Keywords: Crawling, World Wide Web 



Evaluatin g topic-driven web crawlers 

Filippo Menczer, Gautam Pant, Padmini Srinivasan, Miguel E. Ruiz 

September 2001 Proceedings of the 24th annual international ACM SIGIR conference 
on Research and development in information retrieval SIGIR '01 

Publisher: ACM Press 

Full text available* ■PI odf(210 09 KB) Addit ' onaJ Information: full citation , abstract, references , citin gs, index 
U ' IS- 0 — : terms 

Due to limited bandwidth, storage, and computational resources, and to the dynamic 
nature of the Web, search engines cannot index every Web page, and even the covered 
portion of the Web cannot be monitored continuously for changes. Therefore it is essential 
to develop effective crawling strategies to prioritize the pages to be indexed. The issue is 
even more important for topic-specific search engines, where crawlers must make 
additional decisions based on the relevance of visited pages. ... 

Keywords: InfoSpiders, PageRank, Web information retrieval, best-first search, focused 
crawlers, performance metrics, topic driven crawling 



6 Intelligent crawling on the Wor ld Wide Web with arbitrar y predicates 
Charu C. Aggarwal, Fatima Al-Garawi, Philip S. Yu 

April 2001 Proceedings of the 10th international conference on World Wide Web 
WWW '01 

Publisher: ACM Press 

Full text available: ^| pdf(272.60 KB) Additional Information: full citation , references , citin gs, index terms 
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7 Web reso urce crawlin g and sea rching: Lazy preservation: reconstructing websites by Q 
crawling the crawlers 

Frank McCown, Joan A. Smith, Michael L Nelson 

November 2006 Proceedings of the 8th annual ACM international workshop on Web 
information and data management WIDM '06 

Publisher: ACM 

Full text available: *g| pdf(720.52 KB) Additional Information: full citation, abstract, references, in d ex t e rms 

Backup of websites is often not considered until after a catastrophic event has occurred to 
either the website or its webmaster. We introduce "lazy preservation" digital 
preservation performed as a result of the normal operation of web crawlers and caches. 
Lazy preservation is especially suitable for third parties; for example, a teacher 
reconstructing a missing website used in previous classes. We evaluate the effectiveness 
of lazy preservation by reconstructing 24 websites of varying sizes ... 

Keywords: cached resources, digital preservation, recovery, search engine 



Tracking the changes of dynamic web pages in the existence of URL rewritin g 
Ping-Jer Yeh, Jie-Tsung Li, Shyan-Ming Yuan 

November 2006 Proceedings of the fifth Australasian conference on Data mining and 
analystics - Volume 61 AusDM '06 

Publisher: Australian Computer Society, Inc. 

Full text available: || | pdf(4Q2.54 KB) Additional Information: full citation , abstract , references 

Crawlers in a knowledge management system need to collect and archive documents from 
websites, and also track the change status of these documents. However, the existence of 
URL rewriting mechanism raises a page tracking problem since the URLs of a pair of 
dynamic page instances obtained during different sessions will no longer be the same. 
This paper proposes a series of algorithms in a bottom-up manner to find the 
corresponding pairs of dynamic page instances, and then to judge the change s ... 

Keywords: HTTP session, URL rewriting, crawler, string matching 



9 Crawler-Friendly W eb Servers 

A. Onn Brandman, Junghoo Cho, Hector Garcia-Molina, Narayanan Shivakumar 

^ September 2000 ACM SIGMETRICS Performance Evaluation Review, volume 28 issue 2 

Publisher: ACM Press 

Full text available: c jg?] pdf( 51 3.04 KB ) Additional Information: full citation , abstract , citings, index terms 

In this paper we study how to make web servers (e.g., Apache) more crawler friendly. 
Current web servers offer the same interface to crawlers and regular web surfers, even 
though crawlers and surfers have very different performance requirements. We evaluate 
simple and easy-to-incorporate modifications to web servers so that there are significant 
bandwidth savings, Specifically, we propose that web servers export meta-data archives 
decribing their content. 

10 Crawlin g : Parallel crawlers 
^ Junghoo Cho, Hector Garcia-Molina 

May 2002 Proceedings of the 11th international conference on World Wide Web 
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WWW 02 

Publisher: ACM Press 

Full text available' If? odf(230 70 KB) Additional Information: full citation , abstract , references , citings, index 

terms , 

In this paper we study how we can design an effective parallel crawler. As the size of the 
Web grows, it becomes imperative to parallelize a crawling process, in order to finish 
downloading pages in a reasonable amount of time. We first propose multiple 
architectures for a parallel crawler and identify fundamental issues related to parallel 
crawling. Based on this understanding, we then propose metrics to evaluate a parallel 
crawler, and compare the proposed architectures using 40 million pages ... 

Keywords: parallelization, web crawler, web spider 



11 Web retrieval II ( IR ): Desi g nin g clustering-based web crawlin g policies for search 
engine crawlers 

QingzhaoTan, Prasenjit Mitra, C. Lee Giles 

November 2007 Proceedings of the sixteenth ACM conference on Conference on 
information and knowledge management CIKM '07 

Publisher: ACM 

Full text available: H pdf(270.20 KB) Additional Information: full citation , abstract , references , index terms 

The World Wide Web is growing and changing at an astonishing rate. Web information 
systems such as search engines have to keep up with the growth and change of the Web. 
Due to resource constraints, search engines usually have difficulties keeping the local 
database completely synchronized with the Web. In this paper, we study how tomake 
good use of the limited system resource and detect as many changes as possible. 
Towards this goal, a crawler for the Web search engine should be able to predi ... 

Keywords: clustering, incremental crawler, refresh policy, sampling, web search engine 



12 Web 2: Structure-driven crawler generation by example 

Marcio L. A. Vidal, Altigran S. da Silva, Edleno S. de Moura, Joao M. B. Cavalcanti 
August 2006 Proceedings of the 29th annual international ACM SIGIR conference on 

Research and development in information retrieval SIGIR '06 
Publisher: ACM Press 

Full text available: pdf(639.17 KB) Additional Information: full citation , abstract , references , index terms 

Many Web IR and Digital Library applications require a crawling process to collect pages 
with the ultimate goal of taking advantage of useful information available on Web sites. 
For some of these applications the criteria to determine when a page is to be present in a 
collection are related to the page content. However, there are situations in which the 
inner structure of the pages provides a better criteria to guide the crawling process than 
their content. In this paper, we present a structure- ... 

Keywords: digital libraries, tree edit distance, web crawlers 



1 3 Character istics of stre a min g media stored on the Web 
^ Mingzhe Li, Mark Claypool, Robert Kinicki, James Nichols 

November 2005 ACM Transactions on Internet Technology (TOIT), volume 5 issue 4 

Publisher: ACM Press 

Full text available* "p| pdf(936 68 KB) Add^ 003 ' Information: full citation , abstract , references , cit i n gs, index 
. jA|. I terms 

Despite the growth in multimedia, there have been few studies that focus on 
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characterizing streaming audio and video stored on the Web. This investigation used a 
customized Web crawler to traverse 17 million Web pages from diverse geographic 
locations and identify nearly 30,000 streaming audio and video clips available for analysis. 
Using custom-built extraction tools, these streaming media objects were analyzed to 
determine attributes such as media type, encoding format, playout duration, bitra ... 

Keywords: Apple QuickTime, Microsoft Windows Media Player, RealNetworks RealPlayer, 
long-tailed, multimedia, self-sicnilarity, streaming 



14 Web search 1: Topic-oriented collaborative crawlin g 
Chiasen Chung, Charles L. A. Clarke 

November 2002 Proceedings of the eleventh international conference on Information 
and knowledge management CIKM '02 

Publisher: ACM Press 

Full text available' C F^| pdf(179 28 KB) Additional Information: full citation , abstract , references , citin gs, index 
^ terms 

A major concern in the implementation of a distributed Web crawler is the choice of a 
strategy for partitioning the Web among the nodes in the system. Our goal in selecting 
this strategy is to minimize the overlap between the activities of individual nodes. We 
propose a topic-oriented approach, in which the Web is partitioned into general subject 
areas with a crawler assigned to each. We examine design alternatives for a topic- 
oriented distributed crawler, including the creation of a Web page cl ... 

Keywords: distributed systems, text categorization, web crawling 



15 l/O-conscious data preparation for lar g e-scale web search en g ines 
Maxim Lifantsev, Tzi-cker Chiueh 

August 2002 Proceedings of the 28th international conference on Very Large Data 
Bases - Volume 28 VLDB '2002 

Publisher: VLDB Endowment 

Full text available: *g] pdf(292.6Q KB) Additional Information: full citation , abstract , references , index terms 

Given that commercial search engines cover billions of web pages, efficiently managing 
the corresponding volumes of disk-resident data needed to answer user queries quickly is 
a formidable data manipulation challenge. We present a general technique for efficiently 
carrying out large sets of simple transformation or querying operations over external- 
memory data tables. It greatly reduces the number of performed disk accesses and seeks 
by maximizing the temporal locality of data access and orga ... 

16 Models/measurements of traffic/web systems: Web g ra ph analyzer tool 
Konstantin Avrachenkov, Danil Nemirovsky, Natalia Osipova 

October 2006 Proceedings of the 1st international conference on Performance 
evaluation methodolgies and tools vaiuetools '06 

Publisher: ACM Press 

Full text available: ^ pdfd 60.03 KB) Additional Information: full citation , abstract , references , index terms 

We present the software tool "Web Graph Analyzer". This tool is designed to perform a 
comprehensive analysis of the Web Graph structure. By Web Graph we mean a graph 
whose vertices are Web pages and whose edges are hyper-links. With the help of the Web 
Graph Analyzer we can study the local graph characteristics such as numbers and sets of 
incoming/outgoing links to/from a given page, the page level relative to a given root 
page, and the global graph characteristics such as PageRank, Giant Strong ... 

Keywords: PageRank, World Wide Web (WWW), connectivity, crawler, graph theory, 
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1 7 Posters: Distribut e d locat i on aware web crawlin g 
J*^ Odysseas Papapetrou, George Samaras 

^< May 2004 Proceedings of the 13th international World Wide Web conference on 
Alternate track papers & posters WWW Alt. '04 

Publisher: ACM Press 

Full text available: l |D pdf(29.33 KB ) Additional Information: full citation , abstract, references, index te rms 

Distributed crawling has shown that it can overcome important limitations of the today's 
crawling paradigm. However, the optimal benefits of this approach are usually limited to 
the sites hosting the crawler. In this work, we propose a location-aware method, called 
IPMicra, that utilizes an IP address hierarchy, and allows crawling of links in a near 
optimal location aware manner. 

Keywords: distributed web crawling, location aware web crawling 



18 Crawling the web: Building domain-specific web collections for scientific digital 
^ libraries; a meta-search enhanced focused crawling method 
" Jialun Qin, Yilu Zhou, Michael Chau 

June 2004 Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries 
JCDL '04 

Publisher: ACM Press 

Full text available* l P) Ddf(214 74 KB) Additional Information: full ci ta tion, a bs tract, references , citings, index 
'™ -'~ terms 

Collecting domain-specific documents from the Web using focused crawlers has been 
considered one of the most important strategies to build digital libraries that serve the 
scientific community. However, because most focused crawlers use local search 
algorithms to traverse the Web space, they could be easily trapped within a limited sub- 
graph of the Web that surrounds the starting URLs and build domain-specific collections 
that are not comprehensive and diverse enough to scientists and researcher ... 

Keywords: digital libraries, domain-specific collection building, focused crawling, meta- 
search, web search algorithm 



19 Web rank i n g and c l a ssi fic a ti on: Efficient, au tomatic we b res ource harvestin g 
Michael L. Nelson, Joan A. Smith, Ignacio Garcia del Campo 

November 2006 Proceedings of the 8th annual ACM international workshop on Web 
information and data management WIDM '06 

Publisher: ACM 

Full text available: ^ pdf(703.6 3 K B) Additional Information: full citation, abstract, references, index t erms 

There are two problems associated with conventional web crawling techniques: a crawler 
cannot know if all resources at a non-trivial web site have been discovered and crawled 
("the counting problem") and the human-readable format of the resources are not always 
suitable for machine processing ("the representation problem"). We introduce an 
approach that solves these two problems by implementing support for both the Open 
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21 D ... 

Keywords: OAI-PMH, mod_oai, web crawling 
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Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, Sriram 
Raghavan, Gary Wesley 

May 2006 ACM Transactions on Internet Technology (TOIT), volume 6 issue 2 
Publisher: ACM Press 

Full text available: pdf(609.18 KB ) Additional Information: full citation , abstract , references , index terms 

We describe the design and performance of WebBase, a tool for Web research. The 
system includes a highly customizable crawler, a repository for collected Web pages, an 
indexer for both text and link-related page features, and a high-speed content distribution 
facility. The distribution module enables researchers world-wide to retrieve pages from 
WebBase, and stream them across the Internet at high speed. The advantage for the 
researchers is that they need not all crawl the Web before beginning t ... 

Keywords: WebBase Web crawler, distribution, hyperlink indexing, site crawling 
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